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' Abstract Among many existing distance measures for time series data, Dy- 

namic Time Warping (DTW) distance has been recognized as one of the most 
accurate and suitable distance measures due to its flexibility in sequence align- 
Pq [ ment. However, DTW distance calculation is computationally intensive. Espe- 

cially in very large time series databases, sequential scan through the entire 
database is definitely impractical, even with random access that exploits some 
^ ■ index structures since high dimensionality of time series data incurs extremely 

high I/O cost. More specifically, a sequential structure consumes high CPU 
but low I/O costs, while an index structure requires low CPU but high I/O 
, costs. In this work, we therefore propose a novel indexed sequential struc- 

■ ture called TWIST (Time Warping in Indexed Sequential sTructure) which 

l/^ [ benefits from both sequential access and index structure. When a query se- 

■rj" ■ quence is issued, TWIST calculates lower bounding distances between a group 

I of candidate sequences and the query sequence, and then identifies the data 

\^ ■ access order in advance, hence reducing a great number of both sequential 

I and random accesses. Impressively, our indexed sequential structure achieves 

O^ ■ significant speedup in a querying process by a few orders of magnitude. In 

addition, our method shows superiority over existing rival methods in terms 
of query processing time, number of page accesses, and storage requirement 
with no false dismissal guaranteed. 
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1 Introduction 



Dynamic T i me Warping (DTW) distance i Berndt and Cliffordl . 1994t Ratanamahatana and Keogh 



2004 . 2005 : Sakurai et a]. 2007) has been known as on e of the best distance 



measures I Ding et aj 2008l " Keogh and Kasetty . 20031 ) suited for time series 



domain over the traditional Euclidean distance because DTW distance has 
much more flexibility in sequence alignment. In addition, DTW distance tries 
to find the best warping, while Euclidean distance is calculated in one-to-one 
manner, as shown in Figure[TJ Howeyer, DTW distance has a major drawback, 
i.e., it requires extremely high computational cost, especially when DTW dis- 
tance is used in similarity search problems, including top-fc query. More specif- 
ically, in top-fc querying problem, after a query sequence has been issued, a set 
of k candidate sequences most similar to the query sequence ranked by DTW 
distance is returned. Traditionally, the naive approach needs to calculate DTW 
distances for all candidate sequences. As a result, its query processing time 
mainly depends on distance calculation and the number of data accesses. 





(b) 

Fig. 1 The comparison of sequence alignments between a) Euclidean distance and b) DTW 
distance 



So far, many speedup techniques have been proposed including lower bound- 
ing functi o ns and index structures. Lower bounding functions et al, 1993 
Kim et al. 2001 ;_ Keogh and Ratanamahatana . 2005; Zhu and Shashal . l200a 
Sakurai et all . 2005[ ). whose complexity is typically much lower than that of 
a DTW distance measure, are used for a lower bounding distance calculation 
which guarantees that DTW distance must be equal to or larger than the lower 
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bounding distance. Additionally, in sequential scan, before calculating DTW 
distance between the query sequence and a candidate sequence, a lower bound- 
ing function is utilized to approximate and prune off the candidate sequence 
which has larger lower bounding distance than the current best-so-far distance. 
And in indexing, the lower bounding distance is also used to guide the simi- 
larity search. Currently, many lower bounding fun ctions have bee n proposed 
to reduce compu tational cost s including LB Yi (Yi et all. [199 Sl) . LB_Kim 
(Kim et alUaOOl"), LB Keoeh feeo eh and Ratanamaha tanalJ2005h . LB PA A 
(keogh and Ratanamahatana, 2005), LB_NewPAA (|Zhu and Shasha, 2003), 
and LBS ijSakurai et ail2005h . It has been widely known that LB_Keogh and 
LBS are among the most efficient lower bounding functions, where LB_Keogh 
has lower time complexity, while LBS has tighter bound. 

Beside lower bounding functions, various index structures for DTW dis- 
tance have been proposed to guide the search to access only some parts of the 
database. In other words, the search result is returned, while a small portion 
of the database is accessed for distance calculation, i.e., when querying, the 
index structure determines which parts of the database are likely to contain 
answers, and then the raw data on disk are randomly accessed. Generally, 
this index structure should be small enough to fit in main memory. Currently, 
two exact indexing approaches are typically used, i.e., GEMINI framework 
with LB_PAA (jKeogh and Ratanamahatanj, 12005 ). and a more recent ap- 
proach, FTW indexing ijSakurai et ail2005h . Note that the exact indexing re- 
turns a set of querying results with no false dismissal guaranteed; in the other 
words, the best answe rs must be included in the results. GEMINI framework 
( Faloutsos et a _2Pically utilizes the multi-dimensional tree, e.g., R*- 

tree ( Beckmann et aj 1990h . as an index structure, while FTW indexing stores 
indices in a flat flle. However, current indexing techniques are burdened with 
huge amount of I/O cost since random access to the database i s typically 5 
to 10 times slower than the sequential access I Weber et aj 1998[ ). Therefore, 
indexing is efficient when less than 20% of raw data sequences are accessed 
on average. However, current indexing techniques still consumes large I/O 
overheads which are not suitable for massive databases. 

In this work, we propose a novel index structure and access method under 
DTW distance called TWIST (Time Warping in Index Sequential sTructure). 
TWIST utiHzes advantages from both sequential structure and index struc- 
ture, i.e., low I/O and low CPU costs. Instead of randomly accessing the 
raw time series data like other indexing techniques, TWIST separates and 
stores a collection of time series data in sequential structures or flat flies. For 
each flle, TWIST generates a representative sequence (called an envelope) and 
stores this sequence in an index structure. Therefore, when a query sequence 
is issued, each envelope is calculated for a lower bounding distance using our 
newly proposed lower bounding function for a group of sequences (LBG). The 
lower bounding distance between an envelope and a query sequence guarantees 
that all DTW distance between each and every candidate sequence under this 
envelope and the query sequence must always be larger than this lower bound- 
ing distance. Additionally, if the lower bounding distance is larger than the 
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best-so-far distance, no access to the sequences within the envelope is needed; 
otherwise, every sequence in the envelope is sequentially accessed for DTW 
distance calculation. 

We evaluate our proposed method, TWIST, comparing with the current 
best approaches, i.e., FTW indexing and sequential scan with LB_Keogh lower 
bounding function. As will be demonstrated, TWIST prunes off a large number 
of candidate sequences and is much faster than the rival methods by a few 
orders of magnitude. Furthermore, when the size of databases exponentiahy 
increases, our query processing time only grows linearly. 

The rest of the paper is organized as fohows. Section 2 provides literature 
reviews of related work in speeding up similarity search under DTW distance. 
In Section 3, our proposed index structure - TWIST, its access method, and 
novel proposed lower bounding distance functions, are described. We show the 
superiority of TWIST over the best existing method in Section 4. Finahy, in 
Section 5, we conclude our work and provide the direction of future research. 



2 Related Work 



After Dynamic Time Warping (DTW) distance measure (IBerndt and Clifford 



1994 ) has been introduced in data mining community | Keogh and Kasettv 



B agnail et a^ 



2Q03 : iLoh et all. l20Q4l : IWang et aj 120061 : IVlachos et al . I200l- 

2006 : Lin et a r i2007f l. it shows the superiority of similarity matching over tra- 
ditional Euclidean distance due to its great flexibility in sequence alignment 
since time series data mining has been long studied. Specifically, DTW distance 
utilizes a dynamic programming to find the optimal warping path and calcu- 
late the distance between two time series sequences. Unfortunately, to calcu- 
late DTW distance, exhaustive computation is generahy required. In addition, 
since DTW distance is not qualified as a distance metric, neither distance- 
based llCiaccia eTli l.ll997l:lYianilosl.ll993h no r spatial-based dBerchtold et al 
19961 : ICuttmanT 19841 " Beckmann et all . 1990l l index structure can be used ef- 
ficiently in similarity search under DTW distance. 

Therefore, various lower bounding functions and indexing techn iques for 
DTW distance have been proposed to resolve these problems. Yi et al. ( Yi et all . 
1998h first propose a lower bounding function, LB_Yi, using two features of a 
time series sequence, i.e., the minimum and maximum values. LB_Yi creates 
an envelope over a query sequence from these minimum and maximum values, 
and then the distance is computed from the summation of areas between an en- 
velope and a candidate s equence, a s show n in Figure |2K). Instead of using only 
two features, Kim et al. ( Kim et al . 2001) suggest two additional features, i.e., 
the first and the last values of the sequence. LB_Kim then calculates distance 
from the tuples of a query sequence and a candidate sequence, as shown in Fig- 
ure |2b)- Although these two lower bounding functions only require small time 
complexity, the uses of LB_Yi and LB_Kim is not practical since their lower 
bounding distances cannot prune off much of the DTW distance calculations. 
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Keogh et al. propo se a tighter lower boundiiiK function, LB Keogh, utiliz- 

ing global constraints (jSakoe and Chibal . ll978l :l lltakural . Il975l : iRatanamahatana and Keoghl . 
12004 ). which are generally used to Hmit the scope of warping in distance 
matrix to prevent undesirable paths. In addition, various well-known global 
constraints have been propose d, e.g., S akoe- Chiba band ijSakoe and Chibal . 
Il978h . Itaku ra Parallelogram llltakura . 1975 ). and Ratanamahatana-Keogh 
(R-K) band ( Ratanamahatana and Keoghl " 2004 ) . To be more illustrative, Fig- 
ure |3| shows different shapes of global constraints. Note that R-K band is an 
arbitrary-shaped constraint which can represent any bands by using only a sin- 
gle one-dimensional array. LB_Keogh first creates an envelope over a query 
sequence according to the shape and size of the global constraint. Its lower 
bounding distance then is an area between the envelope and a candidate se- 
quence, as shown in Figure 121:). 

In addition, Keogh et al. also propose an indexing technique which utiHzes 
their discretized version of their lower bounding function, LB_PAA. In order 
to create an index structure, they reduce dimensions of each ti me series se- 
quence using Piecewise Average Aggregation (PA A) technique I Keogh et aj 
I2OOI1) . and store the reduced sequence in a multi-dimensional index structure 
such as R*-tree ( Beckmann et all . Il990l ). Each leaf node of the tree, storing on 
disk, contains a collection of segmented sequences, where each sequence points 
to its raw time series data. In querying process, an envelope of the query se- 
quence is created and discretized. Therefore, each MBR (Minimum Bounding 
Rectangle) of R*-tree is retrieved and is compared with the segmented query 
sequence until the leaf node is retrieved in random-access manner. Then, all 
discretized candidate sequences in the leaf node are undergone lower bound- 
ing distances calculation using LB_PAA. If the lower bounding distance from 
the LB_PAA is smaller than the best-so-far distance, the raw time series se- 
quence is also retrieved by random access, and the distances are determined 
using LB_Keogh and DTW distance, respectively. It is clear that Keogh et 
al.'s index structure requires too many random accesses as the database size 
slightly increases. Note that altho ugh Zhu et al. later pro pose a tighter lower 
bounding function, LB_NewPAA I Zhu and Shasha . 20031 ). the index structure 
still consumes hig h I/O cost. 

Sakurai et al. ( Sakurai et all . 20051 ) propose a new lower bounding function, 
LBS (Lower Bounding distance measure with Segmentation), which requires 
a quadratic time complexity 0{r? ji}), where n is the length of time series 
and t is the size of a segment. To calculate lower bounding distance, LBS 
first quantizes a query sequence and a candidate sequence into sequences of 
segments. Each segment contains two values that indicate the maximum and 
minimum among the data points in the segment. Then, dynamic programming 
is used to find the optimal distance between these two segmented sequences, 
and the resulted distance is determined as a lower bound distance of DTW 
distance. Despite the fact that LBS requires larger computational time and 
space than those of LB_PAA at the same resolution, LBS achieves much 
tighter lower bounding distance. The example of segmented sequence is shown 
in Figure m 



7 




Fig. 4 Illustration of segmented sequences with various resolutions 

To use LBS in indexing, Sakurai et al. proposed an index structure which 
stores pre-calculated segmented sequences. For each time series data, a set of 
segmented sequences is generated by varying segment sizes from the coarsest 
to the finest, and the segmented sequence is stored in a fiat file with a pointer 
to the raw time series data. In querying process, a query sequence is seg- 
mented, and then the index structure is sequentially accessed and calculated 
for lower bounding distance with pre-segmented candidate sequences. If the 
lower bounding distance is larger than the best-so-far distance, the raw time 
series data is retrieved in random access manner. However, the main drawback 
of FTW is that the size of the index structure is approximately twice the size 
of the raw time series database. Therefore, this index structure is definitely 
impractical for massive time series database since the entire index file with size 
larger than the raw data are required to be read once for every single query 
causing large I/O overheads. 

It is worth to note that the existing index structures are not designed for 
massive databases. For example, since LB_PAA utilizes PAA to reduce the 
number of dimensions, as the database size increases, its pruning power sig- 
nificantly decreases; therefore, a huge number of sequences must be accessed 
for distance calculation. Similarly for FTW indexing, when the database size 
increases, the index size will double. In Section 5, our experiments will demon- 
strate that when the database exceeds the size of the main memory, our pro- 
posed method significantly outperforms these rival methods. 



3 Background 

Before describing our proposed method, TWIST, we provide some background 
knowledge, i.e.. Dynamic Time Warping distance (DTW), global constraints, 
and lower bounding distance functions including LB_Keogh and LBS. 
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3.1 Dynamic Time Warping Distance 



Dynamic T ime Warping (DTW) distance i Berndt and Cliffordl . 1994t Ratanamahatana and Keogh 



2no,4l2on4l i is a well-known shape-based similarity measure. It uses a dynamic 



programming technique to find an optimal warping path between two time 
series sequences. To calculate the distance, it first creates a distance matrix, 
where each element in the matrix is a cumulative distance of the minimum 
of three surrounding neighbors. Suppose we have two time series, a sequence 
Q = (gi, . . . , g.i, . . . , g„) and a sequence C — (ci, . . . , Cj, . . . , Cm)- First, we 
create an n-by-m matrix, and then each element, 7ij, of the matrix is 
defined as: 

= \qi -Cjr + min{7,,_ij_i,7,_ij,7,j_i} (1) 

where is the summation of \qi — Cj \^ and the minimum cumulative distance 
of three elements surrounding the element, and p is the dimension of Lp- 
norms. For time series domain, p ~ 2, equipping to Euclidean distance, is 
typically used. After we have all distance elements in the matrix, to find an 
optimal path, we choose the path W = {wi, . . . ,Wk, ■ ■ ■ ,wk) that yields a 
minimum cumulative distance at (n,m), where Wk is the position at A;*'* 
element of a warping path, wi — (1, 1), and wk = {n, m), which is defined as: 



DTW(Q,C) = min 



\ 



K 

fc=l 



where d^k is the Lp distance at the position Wfc , p is the dimension of Lp-norms 
in Equation [U and W is a set of all possible warping paths. The recursive 
function are shown in Equation [31 Note that, in the original DTW, p*'* root 
of the distance must be computed; however, for fast computation, we usually 
omit this calculation since ranking of distance values does not change. 



DTW{Q,C) = ^D{n,m) (3) 

r D{i-hj-l) 
D{t, j) = \q^- c, \" + min <^ D{i (4) 

where 15(0,0) = 0, D{i,Q) = 15(0, j) = oo, 1 < i < n, and 1 < j < m. 



3.2 Global Constraints 

Although unconstrained DTW distance measure gives an optimal distance be- 
tween two time series data, an unwanted warping path may be generated. 
The global constraint efficiency limits t he optimal path to give a m ore suit- 
able ahgnment. Recently, an R-K band ijRatanamahatana and Keo gh. 2004), 
a general model of global constraints, has been proposed. It can be specified by 
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a one-dimensional array R, i.e., R— (ri, . . . , r^, . . . , r„), where n is the length 
of time series, and is the height above the diagonal in y direction and the 
width to the right of the diagonal in x direction, as shown in Figure [5l Each 
Ti value is arbitrary; therefore, R-K band is also an arbitrary-shaped global 
constraint. Note that when = 0, where 1 < i < n, this R-K band represents 
the well-known Euclidean distance, and when ri = n, 1 < i < n, this R-K 
band represents the original DTW distance with no global constraint. The 
R-K band can also represent the S-C band by giving all = c, where c is the 
width of a global constraint. 




Fig. 5 Global contraint on DTW distance matrix when applying specific R-K band 



3.3 Lower Bounding Distance Function 

Lower bounding distance function for DTW distance is a function that is 
used to calculate a lower bounding distanc e which must always be s i naller 
than or equal to the exact DTW distance [ Yi et al', 'l998'; 'Kim et al', '200lf 



Keogh and Ratanamahatana . 2005t Zhu and Sh asha. 2003; Sakurai et al, 200 



Therefore, in similarity search, the lower bounding function is used to prune 
off the candidate sequences that are definitely not the answers. Typically, 
lower bounding function consumes much lower computational time than the 
DTW distance does. In this work, we consider two l ower bounding functions, 
LB_Keogh /Keogh an d Ratanamahatana . 2005l l proposed by Keogh et al. 



and LBS (Sakurai et al, 20q3" proposed by Sakurai et al. since LB_Keogh is 
the best existing lower bounding function used in sequential search, and LBS 
is the tightnest lower bounding function used in indexing. LB_Keogh creates 
an envelope from a query sequence, and then the lower bounding distance is 
calculated from areas between the envelope and a candidate sequence. Unlike 
LB_Keogh, LBS creates a segmented query sequence and a segmented candi- 
date sequence, and then these two segmented sequences are used to determine 
a lower bounding distance using dynamic programming. 
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3.3.1 LB_Keogh 



To calculate LB_Keogh ( Keogh and Ratanamahatana . 20051 ) . an envelope E = 
(ei, . . . , Ci, . . . , e„) is generated from a query sequence Q = {qi, . . . ,qi, . . . , qn), 
where — {ui, k} , and Ui and U are an upper and a lower values of e^. With 
a specified global constraint R = (ri, . . . , r^, . . . , r„), elements Ui and U are 
computed from Ui = max{qi-ri, • ■ • , Qi+ri} and k — niin {qi—n ^ ■ • ■ : Qi+vi }; re- 
spectively. The lower bounding distance LBKeogh{Q,C) between sequences Q 
and C can be computed by the following equation. 



LBKeogh{Q, C) 




(5) 



otherwise 



where p is the dimension of Lp-norms. Th e proof of LBKeoahiQ,C) 
< DT WjQ.. C) can be found in the original paper (jKeogh and Ratanamahatana . 
l20G5l ). 



3.3.2 LBS 



To calculate LBS (Lower bounding distance measure with Segmentation), a 
query and a candidate sequences must first be segmented. The segmented 
sequence S'^ = (^sf , . . . , sj , . . . , sf) is calculated from the sequence S = 
{si, . . . , Sa, . ■ . , sa) with a given segment size T, where = {us^ ,lsf} , 
usl = max {sj,, . . . , Sy}, IsJ = min {sx,.. . Sy}, x = {a-l)-T+l,y = b- T, 
and 1 < T < A. Although LBS has capability to support segments with dif- 
ferent lengths, in this work, we consider each segments with an equal length 
to demonstrate maximum performance of LBS. The lower bounding distance 
LBS{Q^ , C^) between a segmented query sequence — (gf , . . . , qf , . . . , g^) 
and a segmented candidate sequence = (cf", . . . , cf , . . . , c^) can be com- 
puted by the following equations. 

LBSiQ'^, C^) = ^D{n,m) (6) 



D{t,j)=T-diqf,cJ)+inm{ D{t~l,j) (7) 




- ucj\^ if {Iqf > ucj) 
diqL cj) = { \lcj - uqTl" if {IcJ > uqj) (8) 
otherwise 

where £'(0,0) = 0, D{i,0) = D{0,j) ^ oo, 1 < i < n, 1 < j < m, qf ^ 
{uqf ,lqf} , cf ~ {ucf ,lc[^ , and p is the dimension of Lp-norms. The proof 
of LBslg^, C^) < DTW{Q, C) can be found in the Sakurai et al.'s original 
paper ijSakurai et alboosh . 



11 



4 Time Warping in Indexed Sequential sTructure (TWIST) 

In this work, we propose a novel index structure called TWIST (Time Warping 
in Indexed Sequential sTructure) which consists of both sequential structures 
and an index structure. Each sequential structure stores a collection of raw 
time series sequences, and the index structure stores a representative and a 
pointer to its corresponding sequential structure. The intuitive idea of TWIST 
is to minimize the number of random accesses and minimize the number of 
distance calculations, giving TWIST a much more suitable choice for massive 
database than the existing methods which are not quite scalable. 



4.1 Problem Definition 

We are interested in a generic top-fc querying in this work since many other 
mining tasks, e.g., classification and clustering, all require this best-matched 
querying as their typical subroutine. Given a query sequence Q, a set C of 
equal-length time series sequences, a global constraint R, and an integer k, it 
returns a set of k nearest-neighbor sequences of Q from C under DTW distance 
measure with the constraint R. 



4.2 Data Structure 

In this section, we describe the data structure of TWIST which is specially 
designed to minimize both the I/O and CPU costs in the querying process. 
TWIST consists of two main components, i.e., a set of sequential structures 
(called Data Sequence File - DSF) and an index structure (called Envelope 
Sequence File - ESF). In addition, TWIST groups the similar sequences into 
same sequential structure so that in the querying process, if this sequential 
structure greatly differs from a query sequence, TWIST will simply bypass 
that structure. To measure the difference between a query sequence and all 
the sequences in a sequential structure, a representative sequence (called an 
envelope) is pre-determined and stored in an index structure. The main benefit 
of the sequential structure is that, we can acces s all the data in th e sequential 
structure much faster than the random access I Weber et aj 1998h . A sample 
data structure of TWIST is shown in Figure [6l 

Suppose there is a set § of time series sequences S — (si, . . . , s^, . . . , s„), 
DSF simply stores these sequences sequentially. And for each DSF, an envelope 
EG — {egi, . . . ,egi, . . . , egn) for a group of time series sequences is generated, 
where egi = {uegi,legi}, uegi = max{si}, and legi = minjsi}. In addition, 

the data structure of ESF is basically an array A of an object O = {P, EQ} 
containing a pointer P to DSF and an envelope EG. Figure [7] illustrates an 
envelope construction for each DSF. The envelope is determined from an upper 
bound and a lower bound of a group of sequences. 



12 




Fig. 7 An envelope created from a group of sequences 



4.3 Lower Bounding Distance for a Group of Sequences 

In this work, we propose a novel lower bounding distance function for a group 
of sequences called LBG. Instead of calculating lower bounding distances be- 
tween a query sequence and a candidate sequence, LBG returns a lower bound- 
ing distance between a query sequence and a set of candidate sequences; in 
other words, each DTW distance between a query sequence and any candi- 
date sequence in the set is always larger than the lower bounding distance 
from LBG. Therefore, if the lower bounding distance is larger than the dis- 
tance from the best-so-far distance, LBG can prune off all those candidate 
sequences since all the real DTW distances from the candidate sequences are 
guaranteed not to be any smaller. More specifically, TWIST utilizes LBG by 
determining an LBG for each DFS from an envelope sequence stored in the 
EFS so that only some DSFs are accessed which significantly reduces both 
CPU and I/O costs. 

Given a query sequence Q = (qi, . . . ,qa, ■ ■ ■ ,qn) and an envelope EG = 
{egi, . . . , egb, . . . , eg„), where egb = {uegb, legb}. LBG first creates segmented 
query sequences = (gf , . . . , qf , . . . , qj) and segmented envelope EG^ = 
{eg-[, . . . , egj, . . . , egj) with segment size T, where qj = {uqf , Iqf] and 
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egj = {uegj ,legj] . An element qf of segmented query sequence is 
computed by uqf — max{sx, ■ ■ ■ , Sy} and Iqf = mm{sx, ■ ■ ■ Sy}, where x — 
(a — 1) • r + 1, and y — a-T. On the other hand, to segment an envelope EG^ 
elements uegj and legj are created as follows, uegj — maxjue^a;, . . . , uegy} 
and legJ = min {legx, ■ ■ ■ legy}, where a; = (6 — 1) • T + 1, and y = & • T. To be 
more illustrative. Figure [8] shows the segmented envelope EG'^ created from 
an envelope EG. 



Erf 



ED 



(a) 



(b) 



Fig. 8 Illustration shows a) an envelope used to generate b) a segmented envelope when 
calculating LBG 



The lower bounding distance LBG{Q^ , EG^) between a segmented query 
sequence and a segmented envelope EG^ can be computed by the following 
equations. 



LBGiQ^ ,EG^) = {/D{n,m) (9) 



D{i,j)=T-d{ql,egJ)+imn\ D{i^l,j) 



d{Ql,egJ) 



IqJ - uegj 
legJ - uqf 




^ if {IqJ > uegj) 
^ if {legJ > uqf ) 
otherwise 



(10) 



(11) 



where -D(0, 0) = 0, D{i, 0) = £'(0, j) = co, I < i < n, 1 < j < m, and p is the 



dimension of Lp-norms. 



Theorem 1 LetQ"^ = {qj , . . . , qf , qf) and EG'^ = {egf , egJ , . . . , egf) 
be the approximate segments of sequence Q and envelope EG of a group of 
time series sequences C = {Gi, . . . ,Ck, ■ ■ ■ ,Cm}, respectively, where qf = 
{uqf , Iqf} and egJ = {uegj ,legj} , then 



LBGiQ' ,EG')< DTWiQ, Copt) 



(12) 



where Copt is a sequence in C which gives minimum distance to sequence Q, 
and Cfpf is a segmented sequence of Copt- 
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Proof Following from the proof of LBS (jSakurai et a we have 




(13) 



d {qf, egj) 



< 




Iqf - uegJ ^ if {Iqf > uegJ) 
legJ - uqf ^ if {legJ > uqj) 



Iqf — ucj ^ if {Iqf > ucj) 
IcJ — uqf ^ if {Icf > uqJ) 



otherwise 



otherwise 



(14) 



< 




Since d {qf , egJ) < d {qf , cj) , then 




(15) 



LBG{Q^,EG^) < DTW{Q,Gopt) 



(16) 



Q.E.D. 



Since LBG utiHzes the concept of a lower bounding distance calculation be- 
tween a query and a group of sequences. We also propose a lower bounding 
distance function extended from LB_Keogh called LBGk- LBGk obtains lower 
bounding distance from a query sequence Q — (gi, . . . , g^, . . . , and an en- 
velope EG — {egi, . . . , egi, . . . , egn), where egi — {uegi, legi}. Given a query 
sequence Q, an envelope E, and a global constraint R = (ri, . . . , r^, . . . , r„). 
LBGk first creates an envelope of global constraint EGG — {ega, . . . , ega 
, . . . , egcn) from EG, where egci — {uegci, legci}. Elements uegci and legci are 
calculated by uegCi = max {uegi^n, • • ■ , uegi+n) and legCi = min {legion, ■ ■ • j 
legi+n}, respectively. The lower bounding distance LBGk{Q,EG) between 
the query sequence Q and the envelope EG are determined by Equation [T7| 
along with its proof of correctness. 



where p is the dimension of Lp-norms. 

Theorem 2 Let Q — {qi, . . . ,qi, . . . ,qn) be a query sequence and EGG = 
{egi, . . . , egi, . . . , egn) be an envelope of global constraint created from an en- 
velope EG of a group of sequences C = {C\, . . . , Cfc, . . . , Cm}, where Gk = 




(17) 



(cfci, . . . ,Cfc., . . . ,Cfc„), then 
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LBGk{Q,EG) < DTW{Q,Copt) 



(18) 



where Copt is the sequence which gives the minimum DTW distance to Q in 
C. 



Proof Since 



DTW{Q,Copt) 



K 



\ fc=i 



(19) 



where dw^ is the fc**^ distance calculation of sequence Q and the nearest Copt 
in the optimal warping path which calculates distance between qi and Cgpti ■ 
For uegci and legci. 



uegci 

> 

legci = 
< 



max {uegj} 

l—r<j<i+r 

max < max \ck } 

l-r<j<i+r [l<fe<n ^ 
Coptj 

, min {legj} 
min < min {cfc- } > 

l-r<j<i+r [l<fe<n ' ' ) 



^optj 



Since 



LBCxeoghiQ, -^^) 



< DTW {Q, Copt), 



(20) 



(21) 



\qi - uegcif if > uegCi 
^ 1 otherwise 



fe=i 



(22) 



Since K > n from the DTW's conditions, there are three possible cases, 
i.e., \qi - uegcil^ < d^^, \legci - qi\^ < d^^, and < d^^. 
Suppose 



\qi -uegcif < d^^. 



(23) 



DTW requires that, for d^^ and for alH — rj < j <i + ri, each data point qi 
must be compared once with Copt^ 



\qi -uegcif < \qi - CoptX 



(24) 



uegci > Copt.. 



(25) 



The case \\lcgci — Qi\\p — ^t/jfc yields to a similar argument and ^ d^ 
always holds since dyj^ is nonnegative. 
Hence, 
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LBGKeoghiQ, EG) < DTW{Q, Copt) (26) 

Q.E.D. 

4.4 Querying Process 

When a query sequence is issued, ESF is first accessed and lower bounding 
distance from LBG for each envelope is calculated. Therefore, if LBG for any 
DSF is larger than the best-so-far distance, all time series sequences in that 
DSF are guaranteed not to be the answers. TWIST could utilize this distance 
to prune off a significantly large number of candidate sequences by using only 
a very small amount of both CPU and I/O costs. 

Instead of calculating only one level of lower bounding distance, LBG cal- 
culates lower bounding distance iteratively. First, the best-so-far distance is 
initialized with an LBG distance between the coarsest segmented sequences of 
a query sequence and an envelope. Subsequently, each finer envelope sequence 
is used by LBG calculation again and again. If LBG distance is still smaller 
than the best-so-far distance, the DSF is accessed, and all data sequences in 
DSF are then sequentially searched. But if finer LBG is returned with any- 
thing larger than the best-so-far distance, the next DSF is then considered. 
The process is terminated when all envelope sequences in ESF are exhausted. 
The pseudo code of TWIST with LBG is described in Table [H 

Although implementations of LBG and LBGk over TWIST are different, 
we provide solutions for both. The advantages of LBGk over LBG are that 
LBGk requires to access ESF only once, while LBG requires twice the access, 
and when the small global constraint is applied in the querying, LBGk is faster. 
However, LBG achieves a better query performance in terms of query process- 
ing time than LBGk since LBG returns a tighter lower bounding distance, 
independent of the global constraint. 

To query with LBGk under top-fc querying, each envelope sequence is 
sequentially retrieved, and its lower bounding distance is calculated. Then 
LBGk distances are sorted into a priority queue. DSF with smallest LBGk 
distance will first be accessed. Then for each candidate sequence in the DSF, 
sequential search is utiHzed to find the best-so-far sequence. Once the DSF 
access is completed, the lower bounding distance from LBGk distance for the 
next DSF will then be considered. If the lower bounding distance between the 
envelope of the next DSF is larger than the best-so-far distance, the search is 
terminated, and a set of nearest-neighbor sequences is returned. The pseudo 
code is provided in Table El 
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Table 1 Tbp-A' querA-ing under TWTST with LRG 



Algorithm[C] = LBG Top-k Querying [Q,k] 



1 


Let: 


2 


C be a priority queue of answer sequences 


3 


P be a pointer to DSF 


4 


EG be an envelope 


5 


dbest = PositiveInfinite be the best-so-far distance 


6 


T be the coarsest resolution 


7 


for all {P, EG} in ESF / / Finding djest from the coarsest version of ESF 


8 


cLeg = LBWlbs(Q , EG^ ) 


9 


if [cLeg < dbest) dbest = dsG endif 


10 


endfor 


11 


for all {P, W} in ESF 


12 


while (T is not the finest resolution) // Use LBWlbs to prune ESF 


13 


dw = LEWLBsiQ^^EG"^) 


14 


if {dw > dbest) Break and go to the next (P, EG} endif 


15 


Set T to be a finer resolution 


16 


endwhile 


17 


for all C in DSFp 


18 


dlomer = LB(Q, C) 


19 


if {dlQyjer — dbest) 


20 


dtrue =DTW{Q,C) 


21 




22 


C.engueue {{C, dtrue}) 


23 


else 


24 


if (dtrue < di,est) 


25 


C.enqueue {{C, dtrue}) 


26 


C.dequeueQ 


27 


dbest = C.peekQ. dtrue 


28 


endif 


29 


endif 


30 


endif 


31 


endfor 


32 


endfor 


33 


Return C 
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Table 2 Tbp-A' querA-ing under TWTST with LRGk 

Algorithm[C] = LBGk Top-k Querying [Q,k] 



1 Let: 

2 W be a priority queue of envelope distances 

3 C be a priority queue of answer sequences 

4 P be a pointer to DSF 

5 EG be an envelope 

6 t^best = PositiveInpinite be the best-so-far distance 

7 Initialize dij^st = PositiveInpinite 

8 for all {P, EG} in ESF // Calculate LBG distance from ESF for all DSF 

9 dw = LBWK(Q,W) 

10 W. enqueue {{P,dw}) 

1 1 endfor 

12 // Dequeue {P,dw} with smallest dw 

13 /'/' keep searching for an answer while dw < (ifcest 

14 while ({P,dw} = W.dequeueQ and dw < dbeat) 

15 for all C in DSFp 

16 dioi„gr = LB(Q,C) 

17 if (diower < diest) 

18 dtrue= DTW{Q,C) 

19 if (C.sizeQ < k) 

20 Cenqueue {{C, dtme}) 

21 else 

22 if {dtrue < dbest) 

23 Cenqueue {{C, dtrue}) 

24 C.dequeueQ 

25 d(,est = CpeekQ. dtrue 

26 endif 

27 endif 

28 endif 

29 endfor 

30 endwhile 

31 Return C 



Although this paper emphasizes on top-A; querying, range query can simply 
be adapted. Instead of using the best-so-far distance to prune off the database, 
the range distance is used to specify the maximum distance between a query 
sequence and a candidate sequence. In addition, an integer k is set to be 
positive infinite. 

4.5 Indexing Process 

To maintain a data structure, we also propose a machanism to efficiently insert 
and delete data sequences over our proposed index structure TWIST. 

4-5.1 Data Sequence Insertion 

In case of insertion, suppose there exist DSFs and ESF, cost of insertion be- 
tween a new sequence and an envelope is computed for all envelopes in ESF, the 
new sequence will be in the minimum cost envelope. After the minimum-cost 
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envelope has been found, the envelope's DSF is accessed, and the new sequence 
is added. The envelope is updated accordingly to the ESF. Generally, the cost 
is computed from the size of an envelope after insertion. If DSF exceeds the 
maximum number of sequences per file (maximum page size), TWIST splits 
this DSF into two DSFs, and two new envelopes are also generated and stored 
in the ESF. For clarification, we provide the insertion algorithm in Table [31 
Note that the maximum page size is a user-defined parameter which deter- 
mines a maximum number of sequences within each DSF. 

Table 3 Inserting a new sequence to TWIST 



Algorithm Insertion [C] 



1 


// Find the minimum-cost DSF 


2 


Initialize COStmin = POSITIVeInPINITE, Pmin = NULL 


3 


for all {P,EGp} in ESF 


4 


cost EG = Cost(EGp, C) 


5 


if (cost EG < COStmin) 


6 


COStmin = cost EG 


7 


P — P 


8 


endif 


9 


endfor 


10 


Add C in DSFp^^.^ 


11 


II Check if the size of DSFp^.^ exceeds a 


12 


if (DSFp^.^ .size{) > a) 


13 


II SplirS5Fp^,„ into two DSFs, DSFx and DSFy 


14 


[{5, EGs} , {T, EGt}] = SplitDSF(DSFp^.J 


15 


Delete {P^i„, EGp^-^} from ESF 


16 


Add {X, EGt} , {xTeGy} to ESF 


17 


else 


18 


// Update EGp^^^^ from C 


19 


EGp^.^ = UpdateEnvelope{EG p^.^, G) 


20 


Update"{P„i„,£;Gp^^„} to ESF™'" 


21 


endif 




Fig. 9 Shadowed area represents total cost of insertion between a sequence C and an 
envelope EG 
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Table 4 Cost function for an insertion of a sequence C into EG 



Algorithm Cost [EG,C] 



1 


Let: 


2 


COStsum — 


3 


for each , ueg^ , legi 


4 


if (ci > uegi) 


5 


COStsum — COStsuTn ~t~ l^i ^^9i\^ 


6 


else if (ci < legi) 


7 


COStsum = COStsum. + \uegi — Ci\^ 


8 


endif 


9 


Return costsum 



Generally, the cost function is calculated from total area of an envelope 
after a new sequence is inserted. To be more illustrative, the shadowed areas 
in Figure [9] indicate the cost of insertion. Given a new time series sequence 
C = (ci, . . . , Ci, . . . , c„) and an envelope EG = {egi, . . . , egi, . . . , egn)-, where 
egi — {uegi, legi}, the cost function Cost{EG,C) is defined as (also shown in 
Table SI). 



Cost{EG, G) 



\ 




if Ci > uegi 
^ if Ci < legi 
otherwise 



(27) 



where p is the dimension of Lp-norms. 

If the number of sequence in DSF exceeds the maximum page size, the 
DSF needs to split into two DSFs to reduce the envelope size. Generally, 
TWIST tries to split sequences into two groups so that each new envelope 
sequen ce is tight and ha s only small overlaps. In this paper, fc-means clus- 
tering I MacQueenl . 1967l l (k = 2) with Euclidean distance is adopted as a 
heuristic function for separating the data into two appropri ate groups . How - 
ever, other a lgorithms such a.s split ting algorithms in R-tree (jGuttmanl . 11984 ) 
and R*-tree ( Beckmann et a l. ll99nf ) can be used in place of fc-means clustering 
algorithm since splitting algorithms are also designed to separate and minimize 
Minimum Bounding Rectangle (MBR); however, these splitting algorithms re- 
quire relatively large time complexity. Pseudo code of the splitting algorithm 
is provide in Table [H 

After new DSFs are created in the insertion step, new envelopes are gen- 
erated by an algorithm described in Table [6] by finding the maximum and 
minimum values for each DSF. If the number of sequences in DSF exceeds 
the maximum allowed, the envelope in ESF is simply updated using the ex- 
isting envelope and a new sequence. To update the existing envelope EG = 
{egi, . . . , egi, . . . , egn) from a new time series sequence C = (ci, . . . ,Ci, . . . , Cn), 
elements are updated by uegi = max {uegi, Ci} and legi = min {l^gi, Ci}, where 
egi = {uegi, legi}. The updating algorithm is described in Table[7l 
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Table 5 Splitting algorithm, separating a DSF into two DSFs 



Algorithm SplitDSF [DSF] 

1 // Run fc-means clustering algorithm 

2 //Fixfc = 2 

3 [DSFx, DSFy] = KMeans(DSF) 

4 // Create EGx and EGy 

5 EGx = CreateEnvelope{DSFx) 

6 EGy = CreateEnvelopeiDSFy) 

7 Return [{X, EGx} , {Y, EGy}] 



Table 6 An envelope construction algorithm 



Algorithm CreateEnvelope [DSF] 



1 


Let: 


2 


EG be an envelope 


3 


for each sequence C in DSF 


4 


for each Ci, uegi, l^di 


5 


uegi = max {uegi, Ci} 


6 


Zegi = min {legi, Ci} 


7 


endfor 


8 


endfor 


9 


Return EG 



Table 7 An envelope sequence update algorithm after a new sequence insertion 

Algorithm UpdateEnvelope [i?G, G] 

1 for each Ci , uegi , legi 

2 uegi = max {uegi, Ci} 

3 legi = mm {legi, Ci} 

4 endfor 

5 Return EG 



4-5.2 Data Sequence Deletion 

To delete a data sequence, coresponding DSF is accessed and the sequence 
is simply deleted. However, when DSF changes, ESF needs to be updated 
as well. In particular, we provide two deletion policies, i.e., eager deletion 
and lazy deletion. For eager deletion, after each sequence deletion, TWIST 
immediately recalculates a new envelope from the entire set of sequences in 
that DSF, and updates the changes into the ESF. On the other hand, lazy 
deletion simply deletes a sequence from DSF without the need of ESF update 
since TWIST guarantees that false dismissals will never occur in the lower 
bounding calculation of LBG. The treadeoffs are, of course, a deletion time 
and the tightness of an envelope between these two deletion poHcies. If eager 
deletion is applied, the deletion time increases but its envelope sequence is 
tighter, while the deletion time is very fast in lazy deletion, but the envelope 
sequence is not as tight. We provide a pseudo code for the deletion algorithm 
in Table El 



22 



Table 8 Delete an existing sequence from TWIST 



Algorithm Deletion [C] 

1 Select DSFp which contains C 

2 Delete C from DSFp 

3 if (IsEager) 

4 EGp = CreateEnvelope{DSFp) 

5 Update {P, EGp} to ESF 

6 endif 



5 Experimental Evaluation 



In experimental evaluation, we compare o ur proposed rnethod , TWIST, with 
the best existing indexing method, FTW llSakurai et a ll20Q5l). and the best 
naive method, sequential search with LB_Keogh i Keogh and RatanamahatanaL 
20051 ) ■ in many evaluation metrics, i.e., querying time, indexing time, the 
number of page accesses, and storage requirement. In addition, two solutions 
of our proposed method are evaluated, i.e., TWIST with LBG and TWIST 
with LBGk- Although FTW index ing outperforms R*-tree with LB_PAA 



(jKeogh and Ratanamahatana 



20051 1. our method shows superiority over FTW 
by few orders of magnitude. In addition, sequential search with LB_Keogh is 
also evaluated to show the best performance of naive method when no in- 
dexing structure is utilized. It is important to note that we make our best 
effort in tuning the ri val methods to run at their best p erformances by ap- 
plying early abandon i Keogh and RatanamahatanaL 2005l l and early stopping 
( Sakurai et all . 2005h techniques; however, as will be demonstrated, our pro- 
posed method still outperforms them in all terms. 

To verify that our proposed method is scalable for massive time series 
database, we use a database with the size exceeding the main memory. Other- 
wise, the operating system is likely to cache the data into the main memory. 
Therefore, our database size ranges from 256MB to 4 GB. We perform our 
experiments on a Windows-XP computer with Intel Core 2 Duo 2.77 GHz, 
2GB of RAM, and 80 GB of 5400 rpm internal hard drive. All codes in our 
experiments are implemented with Java 1.5. 



5.1 Datasets 



To visualize the performance in various dimensions, many different datasets 
listed below are generated by varying the numbers of sequences in the databases 
(216 = 65536, 2^7 = 131072, 2^^ = 262144, and 2^^ = 524288 sequences) and 
the sequence lengths (512, 1024, and 2048 data points). All data sequences are 
Z-normalized; some examples for each dataset are shown in Figure fTOl 

1. Random Walk I ijSakurai et aj l2005HA"ssent et ail2008h : To demonstrate 
the scalability of our proposed method, a large amount of sequences are 
generated by a following equation: ti+i — ti + A^(0, 1), where iV(0, 1) is a 
random value drawn from a normal distribution. 
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2. Random Walk II ( Assent et ail200a ): We generate a set of random walk 
sequences from a following equation: ti+i = 2ti — ti-i + A^(0,1), where 
iV(0, 1) is a randor a value drawn frora a nor mal distribution. 

3. Electrocardiogram l|Moodv and Markl . [l98l : This dataset is recorded from 
human subjects with atrial fibrillation which has 250 samples per second. 
In addition, this dataset was made at Boston's Beth Israel Hospital and re- 
vised for MIT-BIH Arrhythmia Database. To build the dataset, we segment 
all the original sequences into small subsequences. 



5.2 Querying Time 

In this experiment, query processing times are averaged over 100 runs, and 
are compared in the best-matched problem by varying four parameters, i.e., 
the number of time series sequences, the dataset size, the width of global 
constraint, an integer k, and the maximum page size (only for TWIST). In 
order to observe the trend for each parameter, the default values are fixed 
as follows, the dataset size as 524288 (2^^) sequences, the length of time se- 
ries sequence as 2048 data points, the default width of global constraint as 
10% of sequence length, an integer k in top-k querying as 1, and the maxi- 
mum number of sequences in DSF as 128 sequences. In addition, a dataset of 
524288 sequences with length 2048, giving approximately 4 GB in size, and 
10% constraint widt h of global constraint is typicallY us ed in time series data 
mining community (jRatanamahatana and Keoghl . [2005l ). Note that for LBS, 



the default segment size proposed in the original paper is used, i.e., 1024, 256, 
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Fig. 11 TWIST outperforms the rival mettiods, and is slightly affected by an increase in 
the dataset size, where sequence length, global constraint, an integer k, and page size are 
set to 2048, 10%, 1, and 128, respectively 





Fig. 12 Although sequence length increases, TWIST requires only small query process- 
ing time comparing with FTW and LB_Keogh, where database size, global constraint, an 
integer k, and page size are set to 524288, 10%, 1, and 128, respectively 
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Fig. 13 TWIST is faster tiian FTW and LB Keogii for all values of k, where database 
size, sequence length, global constraint size, and maximum page size are set to 524288, 2048, 
10%, and 128, respectively 




9000 




01 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



The widih of global oonslralnl (% of tlniB series length) 

(c) Electrocardiogram 



Fig. 14 TWIST and FTW are not affected by the increment of the global constraint's width; 
however, TWIST outperforms both FTW and LB Keogh, where database size, sequence 
length, an integer k, and page size are set to 524288, 2048, 1, and 128, respectively 
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Fig. 15 When maximum page size changes, TWIST still outperforms the rival methods, 
where database size, sequence length, global constraint size, and an integer k are set to 
524288, 2048, 10%, and 1, respectively 



64, and 16, and LBG uses the same segment size to that of LBS. In sequential 
search in DSF, we implement LBS to reduce the DTW distance calculation. 
However, the segmented sequence is generated online; in other words, no index 
structure is stored on DSF. 

Figures flTl [Tl [131 fll and flSl illustrate the performance of TWIST, com- 
paring in terms of querying time against two rival methods by varying the 
dataset size, sequence length, the width of global constraint, an integer k, and 
maximum number of sequences in DSF. As expected, TWIST greatly outper- 
forms sequential search with LB_Keogh and FTW indexing. 



5.3 Indexing Time 

Indexing time is a wall clock time that an algorithm consumes to build the 
index structure. In this experiment, we only compare the indexing time with 
FTW indexing since the sequential search with LB_Keogh does not need an 
index structure. From an experiment shown in Figure [HI our indexing time is 
comparable to FTW's; however, if the maximum page size is larger, TWIST 
can greatly reduce indexing time, but it may trade off with querying time 
(see Figure fTSj) . The parameters used in this experiment are set to be the 
same as the default parameters from the example in the previous experiments. 
Although the indexing time is comparable to the FTW indexing, TWIST 
requires very small storage space comparing with FTW indexing (as will be 
shown in Section [531) . 
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Fig. 16 As page size increases, the indexing time of TWIST significantly reduces and is 
comparable to FTW's. Note that TWIST still queries faster than FTW for all page sizes 
(see Figure [TsJ. 
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Fig. 17 Number of page accesses of TWIST is smaller than other rival methods, especially 
in Random Walk I and Random Walk II, when speedup factor is 5. 
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Fig. 18 Number of page accesses of TWIST is smaller than other rival methods, especially 
in Random Walk I and Random Walk II, when speedup factor is 10. 



5.4 Number of Page Accesses 

The number of page accesses (77) is generally evaluated in order to estimate 
the I/O cost. We calculate the number of page accesses for TWIST with LBG 
and TWIST with LBGk according to the following equations. 

VLBG = + S (28) 

VLBG^ = + S (29) 

where a is a number of envelopes in ESF, /3 is a number of accessed candidate 
sequences, (5 is a number of random accesse s to DSFs, SF is Speedup Factor 
proposed by Weber et al. I Weber et aj 19981 ) stating that the sequential access 



is much faster than random access up to 5 to 10 times. Generally, two values 
of SFs are considered, i.e., 5 and 10, which represent traditional and practical 
speedup factor of sequential access over random access. 

Since sequential scan accesses the entire database, it can therefore be con- 
sidered as an upper bound. Surprisingly, as shown in Figures flTl and flSl the 
number of page accesses of FTW indexing is approximately equal to that of the 
sequential scan, and is very large when comparing with our proposed method 
TWIST because FTW retrieves the entire index structure which has database 
size nearly doubled. On the other hand, in average cases, TWIST can reduce 
a great number of data accesses since it tries to minimize the number of DSF 
accesses and the number of accessed candidate. For experimental parameters, 
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dataset size, sequence length, maximum page size, global constraint, and k, 
are set to 524288, 2048, 128, 10%, and 1, respectively. 

5.5 Storage Requirement 

In this section, we demonstrate the storage requirement for storing an in- 
dex file comparing with the rival method, FTW. Since FTW creates a set 
of segmented sequences for each candidate sequence, the index file's size is 
larger than the data file's. Therefore, FTW index structure is not practical in 
real world application. Unlike FTW, TWIST's index file requires only small 
amount of storage, i.e., only the envelopes from all groups of data sequences 
are stored. Figure [H] shows the comparison of storage requirement between 
TWIST and FTW. When the dataset size is 2^^ sequences or 4 GB, FTW 
requires nearly 5 GB, but as expected TWIST requires only 110 MB; in other 
words, TWIST requires approximately 51 times less storage space than FTW, 
while still outperforming in terms of querying processing time. 




(a) Random Walk I (b) Random Walk II 
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Fig. 19 Illustration of storage requirement comparison showing that TWIST's index file 
requires only small amount of storage when comparing with FTW's, where dataset size, 
sequence length, and maximum page size are set to 524288, 2048, and 64, respectively. 



5.6 Discussion 

As expected, query processing time increases when the dataset size and the 
sequence length are larger for all approaches. However, from Figures fTD and fT2| 
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we can see that FTW indexing and naive method requires much longer time 
for a single query than TWIST with LBG and LBGk, and when database size 
increases, the query processing time is also much larger. In Figure [131 if the 
global constraint changes, only naive method with LB_Keogh and TWIST 
with LBGk are affected since the LB_Keogh and LBGk lose their tightness 
when the width of the global constraint increases. Although the best-matched 
querying (fc = 1) is typically used in several domains, we also evaluate TWIST 
when varying k as shown in Figure [TH Obviously, when k increases, the query 
processing time also increases since for a large value of k the best-so-far dis- 
tance is also large. If the best-so-far is large, the search cannot use the lower 
bounding distance to prune off the database. However, from FiguredH TWIST 
still efficiently retrieves an answer comparing with other methods. The maxi- 
mum page size is also another important parameter that must be considered 
because TWIST uses it to balance the number of pages in the database and 
the number of sequences in each page. In other words, if the maximum page 
size is small, the number of random access increases; otherwise, the number of 
sequential access will increase. However, from the experiment, when the max- 
imum page number changes, TWIST still outperforms FTW and sequential 
search with LB_Keogh. Note that when we set the maximum page size to one, 
TWIST is identical to FTW, but when the maximum page size is set to infi- 
nite, TWIST is similar to the naive method, i.e., sequential scan. Therefore, 
both FTW indexing and the naive method are special cases of TWIST. 

To evaluate the indexing time, we compare TWIST with FTW indexing by 
varying the database size and the maximum page size in Figure [161 From our 
insertion algorithm, if the number of sequences exceeds the maximum page 
size, TWIST splits DSF into two DSFs. Therefore, if the maximum page size 
is large, TWIST reduces a number of splitting function calls; this therefore 
reduces a number of indexing time since splitting algorithm requires /c-means 
clustering algorithm which has linear time complexity to a number of page 
size. Although the large maximum page size reduces the indexing time, the 
performance when querying is a tradeoff. 

Although we provide the evaluation in terms of query processing time in 
Section 15.21 the number of page accesses needs to be evaluated since the num- 
ber of page accesses reflects the I/O cost for each appro ach. The nurnber of 
page accesses is fo rmulized and calculated according to (jSakurai et a 1 l2on,4 
Weber et aj 1998l l which state that the sequential access is faster than the 



random access flve to ten times. From Figures flTl and [181 the number of page 
accesses of FTW indexing must always larger than the naive approach since 
FTW indexing reads all segmented sequences in the index flle which are equal 
to the number of sequences in the database. Obviously, TWIST consumes only 
small amount of page accesses because TWIST is designed to reduce both se- 
quential and random accesses. 

For the size of an index structure, TWIST utilizes only small amount of 
spaces comparing with FTW indexing which always requires the space twice 
the database size. In Figure [191 we demonstrate TWIST's storage requirement 
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by varying the database sizes and the maximum page number since the size 
of ESF solely depends of the number of DSF in the database. 



6 Conclusion 

In this work, we propose a novel index sequential structure called TWIST 
(Time Warping in Index Sequential sTructure) which significantly reduces 
querying time up to 50 times comparing with the best existing methods, i.e., 
FTW indexing and sequential scan with LB_Kcogh. More specifically, TWIST 
groups similar time series sequences together in the same file, and then the 
representative of a group of sequences is calculated and stored in the index 
structure. When a query sequence is issued, a lower bounding distance for a 
group of sequences is determined from the query sequence and a representa- 
tive is retrieved from the index file. Therefore, if the lower bounding distance 
for a group of sequences is larger than the bcst-so-far distance, all candidate 
sequences in the group does not need to be accessed. This can prune oflF an 
impressively large amount of candidate sequences and makes TWIST feasible 
for massive time series database. 
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